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CENTRAL LIMIT THEOREMS FOR SOME SET PARTITION 

STATISTICS 

BOBBIE CHERN, PERSI DIACONIS, DANIEL M. KANE, AND ROBERT C. RHOADES 


Abstract. We prove the conjectured limiting normality for the number of crossings of a 
uniformly chosen set partition of [n] = {1, 2,..., n}. The arguments use a novel stochastic 
representation and are also used to prove central limit theorems for the dimension index 
and the number of levels. 


1. Introduction 

Let A be a partition of the set [n] = {1, 2,..., n}, so 1|2|3,12|3,13|2,1|23,123 are the five 
partitions of [3]. The enunierative theory of “supercharacters” leads to the statistics 

(1,1) = — rrii + 1) and cr(A) = of crossings of A. 

i 

In (i(A), the sum is over the blocks of A and Mi {rrii) is the largest (smallest) element of 
the block i. The statistic cr(A) counts i < i' < j < j' with adjacent elements of the 

same block and i'^j' adjacent elements of the same block (* *' i i’). In a companion paper 
j3] the moments of d(A) and cr(A) are determined as explicit linear combinations of Bell 
numbers Bn- Numerical computations (see Figures [U and |2]) suggests that normalized by 
their mean and variance, these statistics have approximate normal distributions. Figures [U 
-|n]are based on exact counts from our new algorithms j3]. Figures [T] and H] suggest good 
agreement with the normal approximation for dimension index and crossings. Figure 3 shows 
slower convergence for levels and suggests a search for finite sample correction terms. We 
found the limiting normality challenging to prove using available techniques (eg. moments, 
Fristedt’s method of conditioned limit theorems [9], or Stein’s method ID- Indeed, the 
limiting normality of cr(A) is conjectured in |13) . 

A key ingredient of the present paper is a stochastic algorithm for generating a random 
set partition due to Stam Supplementing this with some novel probabilistic ideas allows 
standard “delta method” techniques to finish the job. 

Brief reviews of the extensive enumerative, algebraic and probabilistic aspects of set par¬ 
titions are in [16] and [25]. The book of Mansour |18j contains applications to computer 
science and much else. An important paper combining many of the statistics we work with 
is [2]. The companion paper [3] has an extensive review. It also summarizes the literature on 
supercharacters. Briefly, these are natural characters xa on the uni-upper triangular matrix 
group Uni^q) which are indexed by set partitions. The representation corresponding to xa 
has dimension xhe (usual) inner product between xa and Xfi is < Xa, X^J. >= 

This suggests understanding how d{\) and cr(A) vary for typical set partitions. 
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Dimension exponent Standard Normai 

Figure 1. Histogram of the dimension exponent connts for n = 100 and the 
associated Q-Q plot. 



Figure 2. Histogram of the crossing nnmber connts for n = 100 and the 
associated Q-Q plot. 
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Figure 3. Histogram of the level connts for n = 100 and the associated Q-Q plot. 
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There are many codings of a set partition. One needed below codes A as a sequence 
Xi,a; 2 , • • • with Xi = j if and only if i is in block j of A. Thus 135|24|6|7 corresponds 
to 1, 2,1, 2,1, 3,4. If Qi = Xi — 1, tti, 02 , • • • , a„ is a restricted growth sequence: Oi = 0 and 
fli+i < 1 + max(ai, • • • , a,) for 1 < i < n — 1. This standard coding is discussed in [161 page 
416]. For this coding, let 


( 1 . 2 ) 


L(A) = |{i : Xi+i = Xi}\ 


the number of levels of A. This is used as an example of the present techniques. See [T8l 
Chapter 4] for further references. 

The main theorems proved use the positive real solution of ue^ = n + 1 (so = 
log(n) — loglog(n) + o(l) [8]). Let n(n) be the set of partitions of [nj. Throughout, A is 
uniformly chosen in n(u). 

Theorem 1.1. The number of levels L{X) has /i^ = E(L(A)) = (n — 1)^%^ ~ log(u) and 

{(JnY = VAR(L(A)) = (n — 1)-^^ + n{n — 1)-^^ — (n — 1)^-^^^ ~ log(n). Normalized by 
its mean and standard deviation, L{X) has an approximate standard normal distribution 


P 


V 








for all fixed x as n ^ oo. 

Theorem 1.2. The dimension index d{X) has = E((i(A)) = + 0 ~ 

VAR((i(A)) = n^+0 • Normalized by its mean and standard deviation, d{X) 

has an approximate standard normal distribution 


P 


( d{X) - pj 

V < 



-)■ 






T 


for all fixed x as n ^ oo. 

Theorem 1.3. The number of crossings cr(X) has pdff = E(cr(A)) = + O 

and = VAR(cr(A)) = + O • Normalized by its mean and standard 

deviation, cr{X) has an approximate standard normal distribution 


P 


/ cr(A) - p: 

\ ^ n 









r 


for all fixed x as n ^ oo. 


Section [2]of this paper explains Stam’s algorithm and shows how it gives a useful heuristic 
picture of what a random set partition “looks like”. The limit theorem for levels is proved in 
Section [3] as a simple illustration of our proof technique. The dimension index and number 
of crossings require further ideas. They are given separate proofs in Sections 0] and [5l 

















4 


BOBBIE CHERN, PERSI DIACONIS, DANIEL M. KANE, AND ROBERT C. RHOADES 


Notation 

Throughout, we use the stochastic order symbols Op and Op. If for 1 < u < cxo is 
a sequence of real valued random variables and is a sequence of real numbers, write 
Xn = Op{an) if for every e > 0 and some r] > 0, which may depend on e, there is N so that 
P{|W| < ri\an\} > 1 — e for all n > N. Write = Op{an) if for every e > 0 and rj > 0 there 
is N so that P{|X„| < r] \an\} > 1 — e for all n > N. For background, examples and many 
variations see Pratt |21| . Lehman im, or Serfling |22]. We say two sequences of random 
variables are weak star close if their distributions are close in Levy metric. 


2. StAM’S ALGORITHM AND SET PARTITION HEURISTICS 


Write n(?7,) for the set partitions of [n] = {1,2, • • • ,n} and Bn = |n(n)| for the nth Bell 
number (sequence AOOOllO of Sloane’s |23)L To help evaluate asymptotics it is helpful to 
have 




Oir. 


Bn+l _ 1 

Bn an 2 (1 + an)^ V n 
Bn+k {n + k)\f^ k 


+ o( 


Bn 

^n-\-k 

Oir). 


nlaz 


= 1 + 0 


1 + 0 
k 


nUr 


nlog(n) 


which are valid for fixed k as n ^ oo. See, for instance, |2]. Dobinski’s identity [71 


(2.1) Bn = 
shows that for fixed n G {0,1, 2, • • ■ } 

( 2 . 2 ) +n{m) = 


E 

m=l 


n 


m 

ml 


1 m” 
eB„ ml 


is a probability measure on {1,2,3, •••}. Stam |2^ uses this measure to give an elegant 
algorithm for choosing a uniform random element of n(?7,). 

Stam’s Algorithm 

(1) Choose M from /r„. 

(2) Drop n labelled balls uniformly into M boxes. 

(3) Form a set partition A of [n] with i and j in the same block if and only if balls i and 
j are in the same box. 

Of course, after choosing M and dropping balls, some of the boxes may be empty. Stam 
shows that the number of empty boxes has (exactly) a Poisson distribution and is 
independent of the generated set partition. This implies that the number of boxes M drawn 
from Hn at fl2.2p has the same limiting distribution as the number of blocks in a random 
A G n(n). This is a well studied random variable. It will emerge that the fluctuations of M 
are the main source of randomness in Theorems 11.11 - 11.31 Results of Hwang |11] prove the 
following normal limit theorem (Hwang also has an error estimate). 
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Theorem 2.1. For M chosen from of (12.211 . as n ^ oo 

''n 

and 


fif := E(M) = 
{a"Y := VAR(M) = 


Br, 

Bn+2 


^- = 4+0 

(y.^ 


n 


ai 


Bn Bl 

Normalized by its mean and standard deviation, M has an approximate standard normal 
distribution. 


Heuristic I. Stain’s algorithm gives a useful intuitive way to think about a random element 
of n(?T,). It behaves practically the same as a uniform multinomial allocation of n labelled 
balls into m = nj log(n) boxes. The arguments in the following sections make this precise. It 
appears to us that many of the features previously treated in the beautiful paper of Fristedt 
[9] can be treated by the present approach. Note that Fristedt treated features that only 
depend on block sizes (largest, smallest, number of boxes of size i). None of our statistics 
have this form. 

Heuristic H. Fristedt’s arguments randomize n. This makes the block variables, iVj(A) = 
jf blocks of size i, independent allowing standard probability theorems to be used. At the 
end, a Tauberian argument (dePoissonization) is used to show that the theorems hold for 
hxed n. The present argument hxes n and randomizes the number of blocks. This results in 
a “balls in boxes” problem with many tools available. At the end, an Abelian argument shows 
that the appropriate limit theorem holds when m fluctuates. See [1^ for background on this 
use of Abelian and Tauberian theorems. There are many variants of Poissonization in active 
use. We do not see how to abstract Stam’s algorithm to other combinatorial structures. 

We conclude this section with a simple illustration of Stam’s algorithm. From fl2.2p . 
Friijn) = is a probability measure on {1, 2, 3, • • • }. Thus for —n < d < oo 


(2.3) 




°° jyjn+d 


B. 


n+d 


m=l 


m\ 


Br, 


Let us apply this to compute the moments for L{X), the number of levels of A G n(n). From 
the dehnition fll.2p . given M, L{X) = Xi + • • ■ + Xn-i where X* is the indicator random 
variable of the event that balls i and i + 1 are dropped into the same box. By inspection, 
the Xi are independent with P(Xj = 1 ) = A, Thus 

(2.4) E„(L(A))=E„E(L(A|M))=E„('4^j =(n-l)i|n. 

The standard identity 


VAR(Z) = E(VAR(Z|W)) + VAR(E(Z|W)) 


for any random variables Z and W such that the moments exist, shows that 

(2.5) VAR.(L(A)) = {n- 1)%^ + n(n - 1)%^ - (n - 1)=%1. 

More generally, this provides an alternative approach to [3] for showing that the moments 
of statistics T(A) are shifted Bell polynomials. It requires E„(T(A)|m) to be a Laurent 
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polynomial in m. As an example, Stam worked with Wi{\), the size of the block in A 
containing i, 1 < i < n. Then, any polynomial in the has expectation a shifted Bell 

polynomial; for example, and WiWj. Stam proves that Wi is approximately normal. 


3. Proof of Theorem If .fl 


Theorem 11.11 is proved here as a simple illnstration of onr techniqne. Conditioning on M 
in Stam’s algorithm, classical “balls in bins” central limit theorems are nsed to prove the 
limiting normality nniformly in M and standard (5-method argnments are nsed to complete 
the proof. 


Proof of Theorem \l.l[ The moments of the level statistic L{X) are compnted in fl2.4D and 
fl2.5p . Conditional on M, L{X) = Xi + - ■ ■ + Xn-i with Xi independent identically distribnted 
binary variables with P(Xj = 1) = 1/M. Thns conditioned on M, 


E(L(A) I M) = 


n — 1 
M ^ 


and VAR(L(A) | M) = 


n 


M 


1 - 


M 


and, normalized by its conditional mean and variance, L{X) has a standard normal limiting 
distribntion provided n/M —)■ cx). In the present case, M = M„ is a random variable. From 
Theorem 12.11 as n tends to infinity 


(3,1) 


Mn - ^ 


M 


N{0, 1) with /if ~ ^ (erf ^ TA 


n 


M\2 


n 


(T, 


M 


(Xr, 


a 


This implies 
(3.2) 

To be precise, write Mn = /if -|- ZnCr^ with Zn = . Then 


Tl /I 

jr = ^n + 0. 


n 


(33) TT = 


n 


n 


Mn /if + Znaff (^l + j /i. 


^ T -L- _L n 

= — 1 + XM^n + O 
n \ Pn 


' 7 M \ 2 


P 


M 



From Theorem 12.11 n/= an + 0{anln), (jf//if = 0{l/^/n). Since Zn = Op(l), (13.2h 
follows. 

Thns, with probability close to 1 with respect to M we have that L{X) conditioned on M 
is weak star close to a Ganssian with mean 


/i^ = 


n 


M 


= an + Op{n 


and standard deviation 




\l M m) + 


Thns, with high probability over M, the conditional distribntion on T(A) is weak star close 
to N{an, ^/oifi). Therefore, the overall distribntion of L{X) is also close to this normal 
distribntion. 

□ 
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4. Proof of Theorem 11.21 

In outline, the proof proceeds by choosing a random A G n(?7,) using Stam’s algorithm. 
Conditioning on the chosen m reduces the problem to a slightly non-standard balls in boxes 
problem. Given m, it is shown that d{X) = nm — 2m? -|- Op{m^/‘^) so that the fluctuations 
in d{\) are driven by the fluctuations in m. These are asymptotically normally distributed 

with mean and variance From Theorem 12.11 above, a simple averaging argument 

completes the proof. The first proposition treats the balls in boxes argument. It proves more 
than is needed. The argument is useful for statistics such as T(A) = where the sum 

runs over the blocks of A indexed by i and Mi is the maximum element in the fth block. 

The first step in the proof is to prove the appropriate approximation conditional on m. 
While it would be of interest to explore this for general n, m, we content ourselves with 
proving what is needed for Theorem 11.21 From Theorem 12.11 the relevant values of m are 
large hxed values of c. This explains the choice in the next lemma. 

Lemma 4.1. Fix a large number C. Let n balls labeled 1,2, •• • ,n be dropped uniformly at 
random into m boxes with m = — + ■ For I cl < C . Let 

On log(n) I I — 

m 

-Dn = ^ {Mi -miFl) 

i=\ 

with Mi the maximum label in box i and mi the minimum label of box i. Mi — m^ is omitted 
if box i is empty. Then = nm — 2m? -|- uniformly in |c| < C. 


Proof. Consider an inhnite supply of balls labelled 1,2,3,... dropped uniformly at random 
into m boxes. Let ITj 1 < i < m be the waiting time until i boxes have been hlled. 
Thus Wi = 1, W 2 - Wi is GEOMETRIC(l/m), Ws - W 2 is GEOMETRIC(2/m), ..., 
kF„ — kF„_i is GEOMETRIG((m — l)/m) and all these differences are independent. Here, 
if X is GEOMETRIG(0), P(X = j) = e^-\l - 9), E(X) = 1/9, and VAR(X) = | (1 - l). 
Let Et be the number of empty boxes at time t and Lt be the largest i so that Wi < t. If 
Lm < t all boxes are non-empty at time t and Et = D. More generally, Lt = m — Et. 

The sum + ■ ■ ■ + This sum may be controlled by showing that En 

is bounded with high probability and then bounding the sum by Ghebychev bounds. The 
same argument works for Toward this end, represent 


i=l 

E{En) = m(l-—^ , 
\ mj 

By elementary estimates 


where Xi = 

XAR{En) = m 



box i is empty after n balls 
box i is not empty after n balls 



n 

+m{m—l) 



C 


(4.1) 


E(Ej = l + 0 


VAR(E„) = 1 + 0 


C 
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Indeed, m (l — Using the assumption m = -^ + log(m) = 

an + O > m ~ ^ ( v^iog(n) ) ‘ This gives the hrst result in fl4.ip . the second follows 

similarly. By classical results [1], En is approximately POISSON(l) distributed with an 
explicit total variation error but this is not needed. 

Consider next 


Sn = Wi + --- + Wm = mWi + (m - 1)(IU2 - IPi) + • • • + 2(IU™_i - lU^-a) + (Wm-i - W^). 


(4.2) ^{Sn) — Y + + ■ ■ ■ + ^ — m 


m—1 


(4.3) VAR(5„) = 5^(m-z) 


m 


m — i \m — i 


-^-1 =E 

1 — 7 / 


mi = m- 


m{m — 1) m? 


2=1 ^ ' 2=1 

Consider next the sum of the box maxima. Drop balls labelled n, n — 1, • • • ,1 sequentially 
into m boxes. If the new arrivals are at times lUi, IU 2 , • • • , hhm, the box maxima are n — 
(lUi — 1), n — (IU 2 — 1),..., n — {Wm — !)• The sum 


Sn = ^ rrii = nm - [Wi^ -h lUm j + n. 


2=1 

Thus 

(4.4) 

(4.5) VAR(S);) = m 
The random variable of interest is 


E(S'„) = n{m + 1) — 

m{m — 1) 


i=l i=Ln+l 

The sum — ^n^m- From the coupon collectors problem Wm is of stochastic 

order mlog(m) ~ n and En is stochastically bounded. A similar argument holds with IFj 
replaced by ITj. It follows that the sum Y1T=l„+i \ = C)p{n). Combining terms 


Dn = Sn - Sn + m + Op{n) =nm+ [Sn - E(S'„)j - (S'„ - E(S'„)) + Op{n). 

By Chebychev’s inequality \Sn — E(S'n)| and Sn — E(S'„) are both Opim^/"^). It follows that 
Dn = nm — 2mS + Op{mS’/‘^). □ 

Proof of Theorem MSA To hnish the proof of Theorem 11.21 note that conditional on M 

d{X) =nM- 2M^ + Op{M^/^). 

This is weak star close to 


N 


n 


Oiri 


^ 2n^ 


+ Ot- 


n 


(y.r 


3/2 
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Since is much smaller than the standard deviation of the normal, this is in turn 

close to 

^ 2n^ 

This completes the proof. □ 


5. Proof of Theorem 11.31 

This section contains the proof of Theorem 11.31 Our approach is to compare the cross¬ 
ing statistic to the dimension statistic, which by Theorem 11.21 is known to be normally 
distributed. 

Proof. To analyze the distribution of the crossing number, we compare it to the dimension 
index. We do this by producing a uniform random set partition A in the following unusual 
way: 

• Pick M from /i„. 

• Pick a uniform random set partition /i for that m according to Stam’s algorithm. 

• Let A be a uniform random set partition conditional on the event that the set of 
minimum elements of blocks of A is the set of minimum elements of blocks on /i and 
that the set of maximum elements of blocks of A equals the set of maximum elements 
of blocks of /i. 

This third step can be accomplished in the following way, assigning the elements of [n\ to 
blocks in order. We begin with no blocks and add elements to blocks one at a time, sometimes 
creating new blocks. If an element /c, where k is the maximum element of some block of 
/i is added to a block in A, we declare that block closed. After having assigned the hrst k 
elements to blocks in A, we assign fc -|- 1 to a uniform random un-closed block, unless k + 1 
is the minimum element of some block of /i, in which case we assign fc -|- 1 to a new block of 
A. This procedure clearly produces a uniform A subject to the restriction on the minimum 
and maximum elements of blocks. 

On the other hand, this method of choosing A gives us a reasonable way to analyze cr(A). 
In particular, the crossing number of A equals the number of pairs of a j G [n] and a block 
S in A with 

• j ^ B 

• j not the hrst element of its block 

• max(i?) > j 

• The element of B immediately preceding j is larger than the element of j’s block 
immediately preceding j 

We note that this is easy to analyze given the procedure above for choosing A. Suppose that 
when k is being added to A that there are a*, blocks of A currently open. If k is the hrst 
element of its block, then we have no crossings with j = k. Otherwise, we claim that the 
number of crossings with j = k (which we call X^) has distribution given by the discrete 
uniform random variable on [0,0^ — I]- In particular, if the open blocks are Bi,... ,Ba^. 
whose element immediately preceding k is mi < m 2 < ... < ma ^., then Xk = k — i ii k is 
assigned to block Bi. Note furthermore, that the Ok are determined by ji and that the Xk 
are independent conditional on fi. Since cr(A) = is a sum of independent random 
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variables, it is easy to see that conditioned on /i that with high probability cr(A) is weak 
star close to 

\k not a minimum y ^ a minimum 

We note that a given block contributes to ak if and only if k is between is minimum and 
maximum values. Therefore, 

n / m \ 

— 1) = I ^ Mi — rrii] — n = nm — 2m^ + Op{rn?^‘^). 
k=l \j=i / 

On the other hand, the sum over at the start of blocks is the number pairs of blocks that 
overlap. Note that for m = n/a„ + Op(n/log^(n)), that any given block has n/2 between 
its minimum and maximum with probability 1 — Thus, for m in this range, the 

expected number of pairs of non-overlapping blocks is Thus, 

~"2 —= nm/2 — bm?/A + Op{m^^‘^). 

k not a minimum 

It is also easy to see that 

- 1) = + Op(l)) = ^(1 + Op(l)). 

k not a minimum 



Therefore, with probability approaching 1 over the choice of m, the distribution of A condi¬ 
tioned on m is close to 

N ( nm/2 — ^vr? jA, -^= 

\ yjl2a 

This can be rewritten (up to small error) as the sum of {n/2 -|- 5n/(2a„))(m — n/an) and a 
variable with distribution 

'n? 5n? A 

On the other hand, by Theorem 12.11 (n/2 -|- 5n/(2Q!„))(m — n/an) is approximated by an 
independent normal weak star close to 




N 



Thus, the distribution of cr(A) is close in cdf distance to this sum of independent normals, 
which is given by 


n^ 5n^ n^/^ A 
2an Aa^’ ^anJ ' 


This completes the proof. 


□ 
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